Statistik für Biologie FS 2022
Solutions
Solutions to the exercises will be posted here each week.
Week 1
Exercise 1
No. Collectors did not choose randomly, but preferred rarer types over more common ones.
It is a sample of convenience.
There is bias in every year’s sample because rare types are over-represerented. Further this bias might change across years as the frequencies of the two morphs have changed over time.
Exercise 2
Discrete.
Technically it is a discrete variable (because fractions can be enumerated / are restricted to finitely many values in a finite sample) but it makes sense to treat it as a continuous variable if the sample is very large.
Discrete. There are no half crimes.
Continuous. Body mass is continuous and hence the log as well.
Exercise 3
E(xplanatory): Altitude (categorical: high vs low) R(esponse): Growth rate S(tudy type): Observational
E: Treatment (standard vs. tasploglutide) R: Rate of insulin release S: Experimental
E: Health status (schizophrenia vs. healthy) R: Frequency of illegal drug use S: Observational
E:Number of legs R: Survival propability S: Experimental
E: Treatment (advanced communication therapy vs. social visits without formal therapy) R: Communication ability S: Experimental
Exercise 4
The main problem is a strong bias in the sample. To see this, consider the sample of planes that were used in this study. Only planes that were not hit in critical areas were available to estimate the distribution of bullet holes. Planes that were hit in a critical area, i.e., one that leads to a crash, were not available because they did not return to base. With this knowledge, it becomes clear that it would have been better to reinforce the areas were no, or very little, bullet holes were found, namely the cockpit and engine.
Exercise 5
The population parameter being estimated is all the small mammals of Kruger National Park.
No, the sample is not likely to be random. In a random sample, every individual has the same chance of being selected. But some small mammals might be easier to trap than others (for example, trapping only at night might miss all the mammals active only in the day time). In a random sample individuals are selected independently. Multiple animals caught in the same trap might not be independent if they are related or live near one another (this is harder to judge).
The number of species in the sample might underestimate the number in the Park if sampling was not random (e.g., if daytime mammals were missed), or if rare species happened to avoid capture. Even if the sample is perfectly random, the estimator will underestimated the true number in most cases. You can sample fewer species just by chance, but not more species as there actually are. Thus - on average - you will underestimate the true number.
Week 2
Exercise 6
You should see a blank plot. This is what ggplot does, it simply creates a “canvas” to which you can add “layers”.
How many rows are in mpg? How many columns?
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
## $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
## $ model : chr [1:234] "a4" "a4" "a4" "a4" ...
## $ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr [1:234] "f" "f" "f" "f" ...
## $ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr [1:234] "p" "p" "p" "p" ...
## $ class : chr [1:234] "compact" "compact" "compact" "compact" ...
There are 234 obsservations of 11 variables. thus, there are 11 cloumns and 234 rows. we can confirm this by looking at the dimension of the underlying data matrix:
## [1] 234 11
drv is a categorical vaiable indicating the drive type: f = front-wheel drive, r = rear wheel drive, 4 = 4wd
ggplot(data = mpg) +
geom_point(mapping = aes(y = hwy,x = cyl)) +
theme_classic() +
labs(title = "A scatterplot", y = "miles per gallon on highways",x="cylinders")ggplot(data = mpg) +
geom_point(mapping = aes(x = class,y = drv)) +
theme_classic() +
labs(title = "A useless plot", x = "class",y="drive type") both drv and class are categorical variable and it does not make sense to visualize them this way. What would be a better way to show these data?
- Color is used wihtin the aesthetics function, where we specifiy which varibales should be used. “blue” is however not a variable in our data frame. Thus it is ignored. The follwoing code creates the correct plot:
- A continuous variable will lead ot a continuous color gradient rather than a discrete set of colors.
- The “+” symbol always has to be in the top row:
Week 3
Exercise 7
genes <- read.csv(url("http://www.zoology.ubc.ca/~schluter/WhitlockSchluter/wp-content/data/chapter04/chap04e1HumanGeneLengths.csv"))
# a)
n = dim(genes)[1]
print(n)## [1] 20290
# b)
sum.gene.lengths = sum(genes$geneLength)
# c)
mean.gene.length1 = sum.gene.lengths/n
mean.gene.length2 = mean(genes$geneLength)
print(mean.gene.length1)## [1] 2622.027
## [1] 2622.027
## [1] 223678143206
# e)
# by hand:
variance.gene.length = sum((genes$geneLength - mean.gene.length1)^2)/(n-1)
print(variance.gene.length)## [1] 4149235
## [1] 4149235
## [1] 2036.967
# or
sd.gene.length = sqrt(variance.gene.length)
# g)
cv.gene.length = sd.gene.length/mean.gene.length1
print(cv.gene.length)## [1] 0.7768672
# h)
# Which one do you like best?
ggplot(data = genes) +
geom_histogram(mapping = aes(x = geneLength),color="black",fill="darkred") +
theme_classic()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = genes) +
geom_density(mapping = aes(x = geneLength),color="black",fill="darkred") +
xlab("Länge der Gene") +
ylab("Häufigkeit") +
scale_x_log10() +
theme_classic()ggplot(data = genes) +
geom_boxplot(mapping = aes(y = geneLength),color="black",fill="darkred") +
theme_classic() +
coord_flip()Week 4
Exercise 8
# load the data file:
stickleback <- read.csv(url("http://www.zoology.ubc.ca/~schluter/WhitlockSchluter/wp-content/data/chapter03/chap03e3SticklebackPlates.csv"))
# one possible solution:
mean.mm = mean(stickleback$plates[stickleback$genotype=="mm"])
mean.mM = mean(stickleback$plates[stickleback$genotype=="Mm"])
mean.MM = mean(stickleback$plates[stickleback$genotype=="MM"])
median.mm = median(stickleback$plates[stickleback$genotype=="mm"])
median.mM = median(stickleback$plates[stickleback$genotype=="Mm"])
median.MM = median(stickleback$plates[stickleback$genotype=="MM"])
par(mfrow = c(1,3))
hist(stickleback$plates[stickleback$genotype=="MM"],
main = "MM",xlab = "number of plates",
col="darkred")
abline(v = mean.MM,col="black")
abline(v = median.MM,col="blue")
hist(stickleback$plates[stickleback$genotype=="Mm"],
main = "Mm",xlab = "number of plates",
col="darkred")
abline(v = mean.mM,col="black")
abline(v = median.mM,col="blue")
hist(stickleback$plates[stickleback$genotype=="mm"],
main = "mm",xlab = "number of plates",
col="darkred")
abline(v = mean.mm,col="black")
abline(v = median.mm,col="blue")# alternatively, we can first create a data frame
# with the means per genotype
# check out the help function of ddply
# (this is only one way to do this, you can also do it by
# "hand" or using other functions such as tapply) :
library(dplyr)
df.mean = ddply(stickleback, "genotype", summarize, m.number = mean(plates))
df.median = ddply(stickleback, "genotype", summarize, m.number = median(plates))
# then we plot it with a single ggplot
ggplot() +
geom_bar(data = stickleback, mapping = aes(x=plates),binwidth=5,color="black",fill="darkred") +
geom_vline(data = df.mean, aes(xintercept=m.number),col="black",size=1) +
geom_vline(data = df.median, aes(xintercept=m.number),col="blue",size = 1) +
facet_wrap(~ genotype) +
ylab("frequency") +
theme_classic()## Warning: Ignoring unknown parameters: binwidth
Exercise 9
The code simualtes the samplind distirbution of the mean. Here is an alternative version for the Poisson distribution:
# We choose the Poisson distribution with mean lambda = 50
lambda = 50
# We do 1000 replicates
reps = 1000
# We choose the sample size n to be 10
sample_size = 10
# the means of each sample will be stroed in this vector
means = vector("numeric",reps)
# in this loop, we will calcualte the means of the sample
for (i in 1:reps)
{
# take a sample of the poisson distribution
sample = rpois(sample_size,lambda)
# store the mean in our vector
means[i] = mean(sample)
}
# a histogram - note that we set freq = FALSE to
# show realtive frequencies
hist(means,freq=F,main="The sampling distribution of a Poisson distribution")
# we specify the x values for the theoretical poisson
x = seq(0,3*lambda,by=0.1)
#we calcualte the standard error:
SE = sqrt(lambda)/sqrt(sample_size)
# caclualte the corresponding PDF of the poisson
y = dnorm(x,lambda,SE)
lines(x,y)Week 5
Exercise 10
- What is the probability that a randomly drawn slice has pepperoni on it?
\[P(pepperoni) = 5/8 = 0.625 = 62.5 \%\]
- What is the probability that a randomly drawn slice has both pepperoni and anchovies on it?
\[P(pepperoni \quad and \quad anchovies) = 2/8 = 25 \%\]
- What is the probability that a randomly drawn slice has either pepperoni or anchovies on it?
\[P(pepperoni \quad or \quad anchovies) = 7/8 = 0.875 = 87.5\%\]
- Are pepperoni and anchovies mutually exclusive on this pizza?
No. There are two slices with both pepperoni and anchovies.
- Are olives and mushrooms mutually exclusive on this pizza?
Yes. There is no slice that has both olives and mushrooms on it.
- Are getting mushrooms and getting anchovies independent when randomly picking a slice of pizza?
No, because
\[P(anchovy) = 4/8 = 0.5\]
\[P(mushroom) = 3/8 = 0.375\]
and
\[P(mushroom \quad and \quad anchovy) = 1/8\]
but
\[P(mushroom)*P(anchovy) = 3/8 * 4/8 = 3/16 = 0.1875 = 18.75 \% .\]
- If I pick a slice from this pizza and tell you that it has olives on it, what is the chance that it also has anchovies?
\[P(anchovy \mid olive) =\] \[P(anchovy \quad and \quad olive)/P(olive) = (1/8) / (2/8) = 1/2 = 0.5 = 50\%\]
- If I pick a slice from this pizza and tell you that it has anchovies on it, what is the chance that it also has olives?
\[P(olive \mid anchovy) =\] \[P(olive \quad and \quad anchovy)/P(anchovy) = (1/8) / (4/8) = 1/4 = 0.25 = 25\%\]
- Seven of your friends each choose a slice at random and eat them without telling you what toppings they had. What is the chance that the last slice has olives on it?
\[P(last \quad slice \quad has \quad olives) = 2/8 = 0.25 = 25\%\].
This can be seen directly because each slice has the same probability to be the last slice.
- You choose two slices at random. What is the chance that they have both olives on them? (Be careful: after removing the first slice, the probability of choosing one of the remaining slices changes.)
\[P(first \, slice \,has \, olives) = 2/8\]
\[P(second \, slice \, also \, has \, olives) = 1/7\]
\[P(both \, slices \, have \, olives) = 2/8 * 1/7 = 1/28 \approx 3.6\% \]
- What is the probability that a randomly chosen slice does not have pepperoni on it?
\[P(no \, pepperoni) = 1 – P(pepperoni) = 3/8 = 37.5\%\]
- Draw a pizza for which mushrooms, olives, anchovies and pepperoni are all mutually exclusive
A pizza for which mushrooms, olives, anchovies and pepperoni are all mutually exclusive
Exercise 11
- What is the probability that you picked up no dangerous snakes?
\[P(no \, dangerous \, snakes) = \] \[ P(no \, dangerous \, snake \, in \, left \, hand) P(no \, dangerous \, snake \, in \, right \, hand)\] \[= 3/8 * 2/7 = 6/56 = 0.107 = 10.7 \%\]
- Assume that any dangerous snake that you pick up has a probability of 0.8 of biting you. The defanged snakes do not bite. What is the chance that, in picking up your two snakes, you are bitten at least once?
\[P(bite) = P(bite \mid 0 \, dangerous \, snakes) P(0 \, dangerous \, snakes)\] \[+ P(bite \mid 1 \, dangerous snakes) P(1 dangerous \, snakes) \] \[+P(bite \mid 2 \, dangerous snakes) P(2 \, dangerous \, snakes)\]
\[P(0 \, dangerous \, snakes) = 0.107\] (from (a)) \[P(1 \, dangerous \, snake) = (5/8 3/7) + (3/8 5/7) = 0.536\] \[P(2 \, dangerous \, snakes) = 5/8 * 4/7 = 0.357\] \[P(bite \mid 0 \, dangerous \, snakes) = 0\] \[P(bite \mid 1 \, dangerous \, snake) = 0.8\] \[P(bite \mid 2 \, dangerous \, snakes) = 1- P(no \, bite \mid 2 \, dangerous \, snakes) = 1-(1- 0.8)^2 = 0.96\]
Putting al this together: \[P(bite) = (0*0.107)+(0.8*0.536)+(0.96*0.357) = 0.772\]
- If you picked up one snake and it didn’t bite you, what is the probability that it is defanged?
\[P(defanged \mid no \, bite)= \left[P(no \, bite \mid defanged)P(defanged)\right]/P(no \, bite)\] \[P(no \, bite \mid defanged)=1\] \[P(defanged) = 3/8\]
$$P(no \, bite) = P(defanged)P(no bite \mid defanged) $$
$$+ P(dangerous)P(no \, bite \mid dangerous) = (3/8 *1) + (5/8 (1-0.8)) = 0.5$$
Therefore
$$P(defanged \mid no \, bite) = (1*3/8 )/ (0.5) = 0.75 = 75\%.$$
Exercise 12
- What is the probability that all five researchers have calculated an interval that includes the true value of the parameter?
\(0.95^5 = 0.774\)
- What is the probability that at least one does not include the true parameter value.
\(1- 0.95^5 = 0.226\)
Week 6
Exercise 14
We first choose values for \(\mu\) and \(\sigma\) (you can chose any values you want)
Next we choose the values for the x axis. A good choice is to cover the range from \(\mu - 4 \sigma\) to \(\mu + \sigma\). We chose a stepsize of 0.1:
Now we compute the corresponding \(y\) values for our plot using the function dnorm:
I now do the same thing for the standard normal distribution:
mu.std = 0
sigma.std = 1
x.std = seq(mu.std-4*sigma.std,mu.std+4*sigma.std,by=0.1)
y.std = dnorm(x.std, mean = 0, sd = 1)And finally we plot both density functions next to each other:
par(mfrow = c(1,2))
plot(x,y,type="l",xlab="x",ylab="density")
plot(x.std,y.std,type="l",xlab="x",ylab="density")We see that both curves look exactly the same. The only difference is the scale of the x and y axes.
- Calculate the 10 percent quantile of the distribution. Store the value in a variable \(Q10\). Then calcualte the proportion of numbers in your random sample from (b) that are smaller or equal to \(Q10\). What do you find?
## [1] 0.095
We find that approximately 10 percent of our random sample is smaller than \(Q10\). This is exactly the definition of a quantile!
Exercise 15
We find that \(p = C\). This means that the cummulative distribution function is the inverse of the quantile function.
- The p-quantile \(q_p\) is defined as the number such that
\[P(X < q_p) = p.\] We define a quantile function \(Q(p) = q_p\). The cumulative distribution function \(F(x)\) gives us the value \[F(x) = P(X<x)\] Then \(F(q_p) = P(X < q_p) = p\)
and because
\[Q(p) = q_p\] if follows
that
\[F(Q(p)) = F(q_p) = p\] Here is a graphical represetantion of this relationship
cdf.std = pnorm(x.std,mean=0,sd=1)
plot(x.std,cdf.std,type="l",xlab="x",ylab = "P(X<x) ")
abline(v=qnorm(0.1,mean=0,sd=1),lwd=2,col="dodgerblue")
abline(h = 0.1,col="slategray",lwd=2)
text(-1.8,0.3,expression(Q[10]),col="dodgerblue")
text(-3.3,0.2,expression(P(X < Q[10])==0.1),col="slategray")Week 7
Exercise 16
Plot the probability density function of a Normal distribution.
- Indicate the smallest 5 percent of the distribution. In other words: Let \(X\) be a normally distributed random variable. For which range \((-\infty,c_1]\) to we get \(P(X \leq c_1) = 0.05\)?
mu.std = 0
sigma.std = 1
x.std = seq(mu.std-4*sigma.std,mu.std+4*sigma.std,by=0.1)
y.std = dnorm(x.std, mean = 0, sd = 1)
plot(x.std,y.std,type="l")
c1 = qnorm(0.05,0,1)
abline(v = c1)- Indicate the most extreme 5 percent of the distribution. In other words: For which range \((-\infty,c_2] \cup [c_2,\infty]\) to we get \(P(X \leq c_2 \text{ or } X \ > c_2 ) = 0.05\)? [Hint: you can use the quantile function of \(R\) and the symmetry of the Normal distribution to simplify things]
mu.std = 0
sigma.std = 1
x.std = seq(mu.std-4*sigma.std,mu.std+4*sigma.std,by=0.1)
y.std = dnorm(x.std, mean = 0, sd = 1)
plot(x.std,y.std,type="l")
c2 = qnorm(c(0.025,0.975),0,1)
abline(v=c2)set.seed(1984)
n = 1000
random.sample = rnorm(n,0,1)
sum(random.sample < c2[1] | random.sample > c2[2])/n## [1] 0.052
plot(x.std,y.std,type="l")
rug(random.sample)
rug(random.sample[random.sample < c2[1] | random.sample > c2[2]],col="red") We see that roughly 5 eprcent fall into the exterme tails of the distirbution. this is snesible, becasue we have defined the extreme part here as the range \((-\infty,c_2] \cup [c_2,\infty]\), where \(-c_2\) (\(c_2\)) is the 2.5 (97.5) percentile of the distribution.
Exercise 17
We first calculate the mean for each genotype:
stickleback = read.csv("~/Dropbox/Teaching/StatisticsForBiology/chap03e3SticklebackPlates.csv")
mean.mm = mean(stickleback$plates[stickleback$genotype=="mm"])
mean.Mm = mean(stickleback$plates[stickleback$genotype=="Mm"])
mean.MM = mean(stickleback$plates[stickleback$genotype=="MM"])Next the upper and lower limit of the 95% CI using a t-test:
t.t.mm = t.test(stickleback$plates[stickleback$genotype=="mm"])
t.t.Mm = t.test(stickleback$plates[stickleback$genotype=="Mm"])
t.t.MM = t.test(stickleback$plates[stickleback$genotype=="MM"])
CI.mm = t.t.mm$conf.int
CI.Mm = t.t.Mm$conf.int
CI.MM = t.t.MM$conf.intWe put it all in a dataframe:
df = data.frame(means = c(mean.mm,mean.Mm,mean.MM),
lower = c(CI.mm[1],CI.Mm[1],CI.MM[1]),
upper = c(CI.mm[2],CI.Mm[2],CI.MM[2]),
genotypes = c("mm","Mm","MM"))Now we can plot everything:
ggplot(data = df) +
geom_errorbar(mapping = aes(x = genotypes,ymin=lower, ymax=upper), width=.1) +
geom_point(mapping = aes(x = genotypes,y=means)) If you prefer (I don’t) ou can also do a bar plot
ggplot(data = df) +
geom_col(mapping = aes(x = genotypes,y=means,fill=genotypes)) +
geom_errorbar(mapping = aes(x = genotypes,ymin=lower, ymax=upper), width=.1) +
geom_point(mapping = aes(x = genotypes,y=means))Either way, it is advised to also add the raw data:
ggplot(data = df) +
geom_jitter(data = stickleback,mapping = aes(x = genotype,y = plates,color=genotype),size=0.8,width=0.05)+
geom_errorbar(mapping = aes(x = genotypes,ymin=lower, ymax=upper), width=.2) +
geom_point(mapping = aes(x = genotypes,y=means))In part (a) we used a one sample t-test to calcualte the conficdence intervals. This test also tested the null hyptohesis that the means of each group are 0. If we want to comapre means between groups we need to do a two sample t-test:
t.test(stickleback$plates[stickleback$genotype=="mm"],
stickleback$plates[stickleback$genotype=="Mm"])##
## Welch Two Sample t-test
##
## data: stickleback$plates[stickleback$genotype == "mm"] and stickleback$plates[stickleback$genotype == "Mm"]
## t = -32.001, df = 208.06, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -41.09355 -36.32416
## sample estimates:
## mean of x mean of y
## 11.67045 50.37931
t.test(stickleback$plates[stickleback$genotype=="mm"],
stickleback$plates[stickleback$genotype=="MM"])##
## Welch Two Sample t-test
##
## data: stickleback$plates[stickleback$genotype == "mm"] and stickleback$plates[stickleback$genotype == "MM"]
## t = -95.49, df = 167.89, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -52.16670 -50.05336
## sample estimates:
## mean of x mean of y
## 11.67045 62.78049
t.test(stickleback$plates[stickleback$genotype=="Mm"],
stickleback$plates[stickleback$genotype=="MM"])##
## Welch Two Sample t-test
##
## data: stickleback$plates[stickleback$genotype == "Mm"] and stickleback$plates[stickleback$genotype == "MM"]
## t = -10.262, df = 207.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -14.78364 -10.01871
## sample estimates:
## mean of x mean of y
## 50.37931 62.78049
We find that all t-test yield a significant result (small p values). This means that the null hypothesis of the mean of two groups being the same can be rejected for all comparisons.
A shorter way is by using the function pairwise.t.tests. The First argument is the variable on which the t-test should be performed. The second parameters is the grouping factor: the categorical variable that separates the first argument into different groups.
##
## Pairwise comparisons using t tests with pooled SD
##
## data: stickleback$plates and stickleback$genotype
##
## mm Mm
## Mm < 2e-16 -
## MM < 2e-16 1.5e-15
##
## P value adjustment method: holm
How does this finding relate to the confidence intervals? The confidence intervals are not overlapping, strongly suggesting that the means between groups are indeed different (at the 95 percent level). Thus, the t-test and the confidence interval indicate the same result.